Search results for "Suffix array"

showing 10 items of 16 documents

Suffix Array Construction on Multi-GPU Systems

2019

Suffix arrays are prevalent data structures being fundamental to a wide range of applications including bioinformatics, data compression, and information retrieval. Therefore, various algorithms for (parallel) suffix array construction both on CPUs and GPUs have been proposed over the years. Although providing significant speedup over their CPU-based counterparts, existing GPU implementations share a common disadvantage: input text sizes are limited by the scarce memory of a single GPU. In this paper, we overcome aforementioned memory limitations by exploiting multi-GPU nodes featuring fast NVLink interconnects. In order to achieve high performance for this communication-intensive task, we …

Multi-core processorSpeedupComputer scienceSuffix array0102 computer and information sciences02 engineering and technologyParallel computingData structure01 natural scienceslaw.inventionCUDAShared memory010201 computation theory & mathematicslaw0202 electrical engineering electronic engineering information engineering020201 artificial intelligence & image processingSuffixData compressionProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
researchProduct

Suffixes, Conjugates and Lyndon Words

2013

In this paper we are interested in the study of the combinatorial aspects connecting three important constructions in the field of string algorithms: the suffix array, the Burrows-Wheeler transform (BWT) and the extended Burrows-Wheeler transform (EBWT). Such constructions involve the notions of suffixes and conjugates of words and are based on two different order relations, denoted by $\plex$ and $\pom$, that, even if strictly connected, are quite different from the computational point of view. In this study an important role is played by Lyndon words. In particular, we improve the upper bound on the number of symbol comparisons needed to establish the $\pom$ order between two primitive wo…

MultisetReduction (recursion theory)BWT; Lyndon factorization; Suffix ArrayString (computer science)Suffix arrayLyndon words Lyndon factorization BWT Suffix array EBWT Circular words ConjugacyLexicographical orderlaw.inventionSuffix ArrayCombinatoricsBWTLyndon factorizationlawOrder (group theory)Symbol (formal)Word (group theory)Mathematics
researchProduct

Sorting suffixes of a text via its Lyndon Factorization

2013

The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can b…

FOS: Computer and information sciencesBWTLyndon FactorizationSettore INF/01 - InformaticaSorting Suffixes; Lyndon Factorization; Lyndon WordsSuffix arrayComputer Science - Data Structures and AlgorithmsData_FILESData Structures and Algorithms (cs.DS)Lyndon wordSorting suffixeSorting SuffixesLyndon Words
researchProduct

Computing the Original eBWT Faster, Simpler, and with Less Memory

2021

Mantaci et al. [TCS 2007] defined the \(\mathrm {eBWT}\) to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. However, since this introduction, it has been used more generally to describe any \(\mathrm {BWT}\) of a collection of strings, and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original \(\mathrm {eBWT}\), which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the \(\mathrm {BWT}\) of a single string that uses …

2019-20 coronavirus outbreakSpeedupString collectionsBig BWTSettore INF/01 - InformaticaSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2)String (computer science)Suffix arrayOrder (ring theory)omega-orderQuantitative Biology::GenomicsBurrows-Wheeler-TransformBurrows-Wheeler-Transform String collections SAIS Big BWT prefix-free parsing extended BWTlaw.inventionCombinatoricsprefix-free parsingSimple (abstract algebra)lawSAISSAIS algorithmIndependence (probability theory)extended BWTMathematics
researchProduct

r-Indexing the eBWT

2021

The extended Burrows Wheeler Transform (\(\mathrm {eBWT}\)) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the \(\mathrm {eBWT}\) that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the \(\mathrm {eBWT}\), i.e., run-length encoded \(\mathrm {eBWT}\) and \(…

Physicsstring compressionBurrows–Wheeler transformSettore INF/01 - InformaticaSearch engine indexingSuffix arrayOrder (ring theory)Burrows-Wheeler-Transform r-index string compression extended BWT compressed indexingBurrows-Wheeler-Transformlaw.inventionCombinatoricsr-indexcompressed indexinglawIndexingextended BWT
researchProduct

On-line construction of two-dimensional suffix trees

1997

We present a new technique, which we refer to as implicit updates, based on which we obtain: (a) an algorithm for the on-line construction of the Lsuffix tree of an n x n matrix A — this data structure, described in [13], is the two-dimensional analog of the suffix tree of a string; (b) simple algorithms implementing primitive operations for LZ1-type on-dine lossless image compression methods. Those methods, recently introduced by Storer [35], are generalizations of LZl-type compression methods for strings (see also [24, 31]). For the problem in (a), we get nearly an order of magnitude improvement over algorithms that can be derived from known techniques [13]. For the problem in (b), we do …

CombinatoricsSuccinct data structureCompressed suffix arrayTree (data structure)Computer sciencelawSuffix treeString (computer science)Generalized suffix treeSuffixData compressionlaw.invention
researchProduct

Efficient Algorithms for Sequence Analysis with Entropic Profiles

2017

Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign …

0301 basic medicineCompressed suffix arrayTheoretical computer scienceEntropySuffix tree0206 medical engineeringGeneralized suffix tree02 engineering and technologyString searching algorithmInformation theorylaw.invention03 medical and health scienceslawGeneticsAnimalsHumansMathematicsApplied MathematicsSuffix arrayComputational BiologyDNASequence Analysis DNAData structure030104 developmental biologySuffixAlignment free Entropy Sequence analysis Sequence comparisonAlgorithms020602 bioinformaticsBiotechnologyIEEE/ACM Transactions on Computational Biology and Bioinformatics
researchProduct

Lightweight LCP construction for very large collections of strings

2016

The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is reali…

FOS: Computer and information sciencesComputer scienceComputation0102 computer and information sciences02 engineering and technologyParallel computing01 natural sciencesGeneralized Suffix ArrayTheoretical Computer Sciencelaw.inventionlawComputational Theory and MathematicComputer Science - Data Structures and AlgorithmsExtended Burrows-Wheeler TransformData_FILES0202 electrical engineering electronic engineering information engineeringDiscrete Mathematics and CombinatoricsData Structures and Algorithms (cs.DS)Discrete Mathematics and CombinatoricAuxiliary memoryLongest Common Prefix Array; Extended Burrows-Wheeler Transform; Generalized Suffix Array;String (computer science)LCP arraySuffix arrayData structureComputational Theory and Mathematics010201 computation theory & mathematicsLongest Common Prefix Array020201 artificial intelligence & image processingJournal of Discrete Algorithms
researchProduct

Linear-size suffix tries

2016

Suffix trees are highly regarded data structures for text indexing and string algorithms [MCreight 76, Weiner 73]. For any given string w of length n = | w | , a suffix tree for w takes O ( n ) nodes and links. It is often presented as a compacted version of a suffix trie for w, where the latter is the trie (or digital search tree) built on the suffixes of w. Here the compaction process replaces each maximal chain of unary nodes with a single arc. For this, the suffix tree requires that the labels of its arcs are substrings encoded as pointers to w (or equivalent information). On the contrary, the arcs of the suffix trie are labeled by single symbols but there can be Θ ( n 2 ) nodes and lin…

Compressed suffix arrayGeneral Computer ScienceSuffix tree[INFO.INFO-DS]Computer Science [cs]/Data Structures and Algorithms [cs.DS]Generalized suffix tree0102 computer and information sciences02 engineering and technologyData_CODINGANDINFORMATIONTHEORYText indexing01 natural sciencesY-fast trielaw.inventionLongest common substring problemTheoretical Computer ScienceCombinatoricsSuffix treelawFactor and suffix automata0202 electrical engineering electronic engineering information engineeringData_FILESArithmeticFactor and suffix automata; Pattern matching; Suffix tree; Text indexing; Theoretical Computer Science; Computer Science (all)Pattern matchingMathematicsSettore INF/01 - InformaticaX-fast trieComputer Science (all)LCP array010201 computation theory & mathematics020201 artificial intelligence & image processingFM-index
researchProduct

Parallel and Space-Efficient Construction of Burrows-Wheeler Transform and Suffix Array for Big Genome Data

2016

Next-generation sequencing technologies have led to the sequencing of more and more genomes, propelling related research into the era of big data. In this paper, we present ParaBWT, a parallelized Burrows-Wheeler transform (BWT) and suffix array construction algorithm for big genome data. In ParaBWT, we have investigated a progressive construction approach to constructing the BWT of single genome sequences in linear space complexity, but with a small constant factor. This approach has been further parallelized using multi-threading based on a master-slave coprocessing model. After gaining the BWT, the suffix array is constructed in a memory-efficient manner. The performance of ParaBWT has b…

0301 basic medicineTheoretical computer scienceBurrows–Wheeler transformComputer scienceGenomicsData_CODINGANDINFORMATIONTHEORYParallel computingGenomelaw.invention03 medical and health scienceslawGeneticsHumansEnsemblMulti-core processorApplied MathematicsLinear spaceSuffix arrayChromosome MappingHigh-Throughput Nucleotide SequencingGenomicsSequence Analysis DNA030104 developmental biologyAlgorithmsBiotechnologyReference genomeIEEE/ACM Transactions on Computational Biology and Bioinformatics
researchProduct